-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Hybrid Data x Context Parallelism Feature #2054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…tron-lm into pmannan/hetero_cp_test_sft
…es for Nano v2 12B" This reverts commit 746c913.
…ia.com:12051/ADLR/megatron-lm into pmannan/hetero_cp_test_sft
…nan/hybrid_dp_cp_feature
dimapihtar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from datasets perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move HybridCPDataLoaderWrapper & BalancedCPScheduler to core/datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been part of possible improvements and feedback I have gotten as well. Duncan was reviewing this on Gitlab before I moved this to GitHub and as of our last discussion, we were going to wait for his review to finalize the re-factoring required once to avoid multiple rounds of re-factor and testing. I'll keep this comment open and we can address this soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved HybridCPDataLoaderWrapper to core/datasets
I would like to keep BalancedCPScheduler here for now as we have a part 2 PR that would introduce the concept of a more flexible scheduler where any scheduler can be used (such as different schedules for PP vs no PP) and we will re-factor the hybrid_cp_schedule file.
|
Hi @parthmannan, could you also start a main PR? |
Added PR for main here - #2282 |
What does this PR do ?
Added PR for main here - #2282
Design document discussed in MCore sync meeting - https://docs.google.com/document/d/1MnIPQ_VbpDNp-adtvcEv-SYx6A8rtt3-fDdxbcdrmk0/edit?usp=sharing
The first issue this MR is trying to solve is the imbalance between DP ranks when using packed sequences (for example in SFT). While packing sequences can help reduce variability in total sequence length, it does not guarantee equal workload. Attention compute is quadratic to sequence length and a single long sequence of 1k has 2x more compute than a packed sequence made of 2x512 length. This problem gets much worse when we have very large sequences and/or a large variation between sequence lengths.
This MR schedules a variable number of microbatches per rank in DPxCP group to ensure balanced workload.
The second issue this MR is trying to solve is redundant CP communication. Our context parallel size is based on the full packed sequence length (usually the max seq length of all samples). For example, if a sequence of 1k requires CP2, we apply CP2 to a packed sequence of 2x512 as well. But in reality, we can easily partition the packed sequence of 2x512 into 2 GPUs by separating the 2 samples without any CP. This MR introduces dynamic context parallelism where each sample is individually scheduled with a dynamic CP group.
To achieve the above, we introduce a balanced scheduler and a dataloader wrapper.
The dataloader wrapper is responsible for collecting the metadata which informs the scheduler of the sequence length of each sample across the entire global batch. This dataloader breaks up the packed sequences into individual samples as they are individually scheduled. Once we have the metadata, we can perform the scheduling using the balanced scheduler which assigns samples to ranks (across DPxCP group) and a dynamic CP group size. To avoid any deadlocks, we divide the schedule into groups (this replaces the notion of microbatches). Within each group, each rank is part of a fixed CP group. However, each rank may run different number of samples in order for all ranks to have a balanced compute.
We have run performance and correctness evaluations using the feature. Using the SFT packed dataset with max seq len of 128k and testing with LLaMa3 8B dummy model, we see 3x performance improvement with this feature. While there is room for improving the baseline itself, the speedup should remain in the 2-3x range.
This is how 128k seq len with CP16 looks like (without this feature). The GPU is bound by CP communications.

This is how 128k seq len with CP16 looks like (with this feature). The GPU is bound by attention compute since all redundant comms have been removed.

Feature correctness (@xiaoyao0115)

This is the first milestone of this feature and there's many improvements that we want to make in the future releases.
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.